# Tensor Virtual Machine

Chen et al. *Presented by Sultan Durrani, Ryan Ziegler*

# Agenda

- Background and Motivation
- Machine learning compilers
- Existing Work (Halide)
- TVM design and optimizations
- Experimentation and evaluation
- **•** Strengths and Weaknesses
- **Future directions**

### How do (CPU) compilers work?

- "Lowerˮ high-level (i.e. C) code into *basic blocks*
- A basic block contains no control flow
- Basic blocks are connected by edges representing control flow
- We can follow these edges to trace dataflow
- SSA IR: values are written to *exactly* once
- Optimization passes transform the IR while preserving program semantics



# Compilers, more generally

- Take an input program, convert it into some representation
- Transform the representation to improve performance while preserving semantics  $func(int)$ :
- Output code (i.e. ASM)

```
int func(int x) {
    return x + 2;
int \text{main}()func(3);
```
●



dword ptr [rbp - 4], edi mov eax, dword ptr [rbp - 4] mov eax, 2 add rbp pop ret main: push rbp rbp, rsp  $mov$ edi, 3 mov call  $func(int)$ eax, eax xor rbp pop ret

push

mov

rbp rbp, rsp

Source C Program

Compiled with -O0

#### Traditional Compilers



#### Machine Learning/Deep Learning Compilers



#### Challenges with Machine Learning compilers

- Need to learn how to use new hardware features and accelerators. For example H100(hopper) introduced wgmma instructions. Different from mma
- Large search space for optimization Need to produce efficient code without manual tuning (huge configuration space)



#### TVM Overview

- TVM takes the IR of ML frameworks and generates a compute graph
- The compute graph contains operators as nodes and edges between them representing data dependencies







#### TVM Optimizations Overview

- Optimizing Tensor Operations
- Optimizing Computation Graphs
- Automating Optimizations with ML cost model

#### Halide

• Problem: highly parallel matrix operations are difficult to express

```
void blur (const Image<uint16 t> &in, Image<uint16 t> &bv) {
 Image\text{-}uint16 \t > bh(in width(), in height())
```

```
for (int y = 0; y < in. height(); y++)
 for (int x = 0; x < in. width(); x++)
  bh(x, y) = (\text{in}(x-1, y) + \text{in}(x, y) + \text{in}(x+1, y))/3;
```

```
for (int y = 0; y < in. height(); y++)
 for (int x = 0; x < in. width (); x++)
  bv(x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1))/3;
```

```
Naive image blur
```

```
void fast blur (const Image<uint16 t> &in, Image<uint16 t> &bv) {
 ml28i one third = mm set1 epi16(21846);
 #pragma omp parallel for
 for (int yTile = 0; yTile < in.height(); yTile += 32) {
  m128i a, b, c, sum, avg;
  m128i bh[(256/8)*(32+2)];
  for (int xTile = 0; xTile < in.width(); xTile += 256) {
    m128i *bhPtr = bhfor (int y = -1; y < 32+1; y++) {
    const uint16_t *inPtr = \&(in(xTile, yTile+y));
    for (int x = 0; x < 256; x == 8) {
       a = mm loadu si128(( m128i*)(inPtr - 1));
       b = mm loadu si128((m128i*)(inPtr + 1));
       c = mm load sil28 ((m128i*)(inPtr));
     sum = mm add epi16( mm add epi16(a, b), c);
    avg = mm mulhi epi16(sum, one third);
     mm store sil28(bhPtr++, avg);
    inPtr := 8;\}bhPtr = bh;
   for (int y = 0; y < 32; y++) {
     m128i *outPtr = ( m128i *) (& (bv(xTile, yTile+y));
    for (int x = 0; x < 256; x == 8) {
       a = mm load si128 (bhPtr + (256 * 2) / 8);
       b = mm load si128 (bhPtr + 256 / 8);
       c = mm load si128 (bhPtr++);
     sum = mm add epi16(mm add epi16(a, b), c);
     avg = mm mulhi epi16(sum, one third);
     mm store sil28(outPtr++, avg);
{}}}}}}
                  Hand-optimized blur
```
#### Halide

- Solution: separate the algorithm from the *schedule* (tiling behavior, vectorization [width], loop ordering, etc)
- Halide takes an algorithm and schedule, and generates code implementing the schedule

```
Func halide blur (Func in) {
Func bh, by:
Var x, y, xi, yi;
// The algorithm
bh(x, y) = (\text{in}(x-1, y) + \text{in}(x, y) + \text{in}(x+1, y))/3;
by (x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1))/3;
// The schedule
bv.tile(x, y, xi, yi, 256, 32)
   vectorize(xi, 8) .parallel(y);bh.compute at (bv, x). vectorize (x, 8);
 return by;
```
#### Why Halide?

- It's easier to optimize an algorithm decoupled from an execution schedule
- Algorithms can be expressed more concisely

```
Func halide blur (Func in) {
Func bh, by:
Var x, y, xi, yi;
// The algorithm
bh(x, y) = (\text{in}(x-1, y) + \text{in}(x, y) + \text{in}(x+1, y))/3;
by (x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1))/3;
// The schedule
bv.tile(x, y, xi, yi, 256, 32)
   vectorize(xi, 8) .parallel(y);bh.compute at (bv, x). vectorize (x, 8);
 return by;
```
#### Why not Halide?

- Execution schedules have a high impact on algorithm runtime
- Difficult to optimize ML because schedules are not graph-level (i.e. cannot use schedules for fusion)
- $\bullet$  Lower-level:  $C++$  embedded DSL

```
Func halide blur (Func in) {
 Func bh, by:
Var x, y, xi, yi;
// The algorithm
bh(x, y) = (\text{in}(x-1, y) + \text{in}(x, y) + \text{in}(x+1, y))/3;
by (x, y) = (bh(x, y-1) + bh(x, y) + bh(x, y+1))/3;
// The schedule
bv.tile(x, y, xi, yi, 256, 32)
   vectorize(xi, 8) .parallel(y);bh.compute at (bv, x). vectorize (x, 8);
 return by;
```
#### **TVM: Tensor Expression DSL**

```
A = t.placeholder((1024, 1024))B = t.placeholder((1024, 1024))k = t. reduce_axis((0, 1024))
 C = t.compute((1024, 1024), lambda y, x:
                 t.sum(A[k, y] * B[k, x], axis=k))
 s = t.create_schedule(C.op)
    for y in range(1024):
      for x in range(1024):
        C[y][x] = 0|-+for k in range(1024):
          C[y][x] += A[k][y] * B[k][x]
```
Very similar to Halide

- Specify the algorithm
- Specify the schdule

#### Cooperation

- Traditional nested parallelism: threads do not access one another's memory
- Cooperative parallelism: all threads fetch data they all need, allows for sharing common data
- TVM implements memory *scopes*: a compute stage can be marked as shared, and the compiler will generate cooperative code

```
for thread group (by, bx) in \text{cross}(64, 64):
                                                All threads cooperatively
  for thread item (ty, tx) in \csc(2, 2):
                                                load AS and BS in different
    local CL[8][8] = 0shared AS[2][8], BS[2][8]
                                                parallel patterns
    for k in range(1024):
      for i in range(4):
        AS[ty][i*4+tx] = A[k][by*64+ty*8+i*4+tx]for each i in 0.4:
        BS[ty][i*4+tx] = B[k][bx*64+ty*8+i*4+tx]
      memory barrier among threads ()
                                                    Barrier inserted
      for yi in range(8):
                                                    automatically
        for xi in range(8):
                                                    by compiler
          CL[yi][xi] += AS[yi] * BS[xi]for yi in range(8):
        for xi in range(8):
          C[yo*8+yi][xo*8+xi] = CL[yi][xi]
```
#### Tensorization

- This is analogous to vectorization on SIMD architectures
- Input instructions are multi dimensional which dictate specific layouts
- Not restricted to a fixed set of primitives, each DL accelerator could potentially have their own flavors of Tensor instructions
- TVM makes tensorization extensible, decouple hardware intrinsic from schedule
- Adds a tensorize primitive to make use of hand crafted micro kernels





Figure 34: MMA .m8n8k16 fragment layout for accumulator matrix C/D with .s32 type

# Explicit Memory latency hiding

- Refers to overlapping memory operations with computations
- In CPU, can be achieved via hardware prefetching or SMT
- In CUDA, we have async memory copies (TMA on H100
- TVM adds virtual threading to transform the program to a single instruction stream





#### Computational graph Optimizations

- Operator implementations are unspecified
- Only: inputs, operations, dependencies
- All dimensions typically known statically
- A computation graph is analogous to a Halide *algorithm*



#### Operator Fusion

- Operator fusion refers to combining multiple operators into one
- Operators are fused following four rules:
	- Injective operators can be fused
	- Reductions may be fused to an injective operator
	- "Complexˮ operators (i.e. conv2d) can be fused with element-wise maps after
	- "Opaqueˮ operators (i.e. sort) cannot be fused



#### Constant folding and data layout transformation

- Constant folding: if some operators have static inputs, compute their output at compile time
- Data layout transformation: adjust how tensors are stored (row major, blocks, …) depending on target device(s)
- Memory planning: adjust memory layout based on characteristics of target device CPU, GPU, custom) to ensure locality



#### Cost model

- How do we decide what optimizations to make?
- Solution: use ML to determine the projected cost (positive or negative) of making a specific optimization
	- Use simulated annealing to perform optimization using model as a cost metric





# Experimentation

- TVM workload optimization over multiple platforms
- **•** TVM vs existing DL frameworks
- TVM support for new DL operations and workloads
- TVM support for specialized accelerators



GPU end-to-end evaluation for TVM, Figure  $14$ : MXNet, Tensorflow, and Tensorflow XLA. Tested on the NVIDIA Titan X.

Server class GPU



Embedded CPU

Figure 16: ARM A53 end-to-end evaluation of TVM and TFLite.



Figure 17: Relative speedup of all conv2d operators in ResNet-18 and all depthwise conv2d operators in mobilenet. Tested on ARM A53. See Table 2 for the configurations of these operators.



Figure 18: Relative speedup of single- and multithreaded low-precision conv2d operators in ResNet. Baseline was a single-threaded, hand-optimized implementation from Caffe2 (commit: 39e07f7). C5, C3 are 1x1 convolutions that have less compute intensity, resulting in less speedup by multi-threading.



Embedded GPU

Figure 19: End-to-end experiment results on Mali-Two data types, float32 and float16, were T860MP4. evaluated.

#### TVM showcase on a custom accelerator



Figure 21: We offloaded convolutions in the ResNet workload to an FPGA-based accelerator. The grayed-out bars correspond to layers that could not be accelerated by the FPGA and therefore had to run on the CPU. The FPGA provided a 40x acceleration on offloaded convolution layers over the Cortex A9.





#### Strengths

- Can generate code for many backends including new accelerators (ex FPGA based one)
- Open source implementation
- Supports popular frameworks like pytorch, tensorflow etc
- Demonstrates performance at par or even better in cases than hand tuned kernel libraries

#### Related Work

- Halide
- TensorFlow XLA
- FFTW and ATLAS

### Limitations of TVM

- ML cost model requires training, which can be slow or costly
	- Optimization search space is extremely wide
- For better performance, more sophisticated operator fusion decision making required.
- Fragmented code base. Model definition in python while operators in Cuda/C++, programmer needs to be familiar with both and also have to learn TVM expression
- Advanced optimizations require understanding of TVM IR, no easy way to do operator extensibility

#### Future Work

- Improve Graph level optimization by using a using some better heuristics for Operator Fusion. Can do static analysis on the kernels and keep note of instructions, blocks, loads, math functions, barriers etc
- Work on some pruning strategies to reduce the search space for the ML cost model
- Support for heterogeneous optimizations. Graph partitioning across multiple different devices, and fuse + map operators based on device affinity